This is a quick demonstration of a standard latent dirichlet allocation (LDA) for topic modeling, geared toward those who want to dig a little deeper than merely getting a result from a package command. As the readme for this repo notes, I did this at one time and then tried to return to it a couple years later, so I’m not sure how clear it will come off. But the basic idea is to first simulate some documents, and then use the topicmodels package to recover the topics via LDA. You can find a basic overview of topic models in a variety of places, and I’ll eventually have a chapter in my graphical and latent variable models document.
Suffice it to say one can approach this in (at least) one of two ways. In one sense, LDA is a dimension reduction technique, much like the family of techniques that includes PCA, factor analysis, non-negative matrix factorization etc. We’ll take a whole lot of terms, loosely defined, and boil them down to a few topics. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. The following is the plate diagram and description for standard LDA from Wikipedia, which follows the typical depiction in Blei’s and other references.
Both \(z\) and \(w\) are both a multinomial draw based on the \(\theta\) and \(\varphi\) distributions respectively. For a given document, one draws a topic, and given the topic, words are drawn from it.
We start by generating what eventually will be a document-term matrix. Rows represent documents, columns terms. Terms are usually words but could be any n-gram of interest. In practice, this is where you’ll spend most of your time, as text is never ready, and must be scraped, converted, stemmed, cleaned etc. In what follows we’ll be generating the text based on the underlying topic model. We’ll essentially be fixing the priors to create \(\theta\) and \(\varphi\) noted above, then given those, draw topics and words given topics based on the multinomial distribution.
We begin the simulation by creating topic probabilities. There will be \(k=3\) topics. Half of our documents will have probabilities of topics for them (\(\theta_1\)) which will be notably different from the other half (\(\theta_2\)). Specifically, the first half will show higher probability of topic ‘A’ and ‘B’, while the second half of documents show higher probability of topic ‘C’. What we’ll end up with here is an \(m\) x \(k\) matrix of probabilities \(\theta\) where each \(m\) document has a non-zero probability for each \(k\) topic.
# Note that in what follows, I strive for conceptual clarity, not code efficiency.
library(tidyverse)
Ndocs = 500 # Number of documents
WordsPerDoc = rpois(Ndocs, 100) # Total words/terms in a document
thetaList = list(c(A=.60, B=.25, C=.15), # Topic proportions for first and second half of data
c(A=.10, B=.10, C=.80)) # These values represent a Dir(alpha) draw
theta_1 = t(replicate(Ndocs/2, thetaList[[1]]))
theta_2 = t(replicate(Ndocs/2, thetaList[[2]]))
theta = rbind(theta_1, theta_2) # Topic probabilities for all 500 docsWith topic probabilities in hand, we’ll draw topic assignments from a categorical distribution, which, for those not in computer science, is the multinomial with size=1 (see the commented line at the end).
firsthalf = 1:(Ndocs/2)
secondhalf = (Ndocs/2+1):Ndocs
Z = t(apply(theta, 1, rmultinom, n=1, size=1)) # draw topic assignment
colMeans(Z[firsthalf,]) # roughly equal to theta_1[1] 0.596 0.228 0.176
colMeans(Z[secondhalf,]) # roughly equal to theta_2[1] 0.104 0.092 0.804
z = apply(Z, 1, which.max) # topic assignment as arbitrary label 1:3
# z = apply(theta, 1, function(topprob) extraDistr::rcat(1, topprob)) # topic assignment via categorical distNext we need the topics themselves. Topics are probability distributions of terms, and in what follows we’ll use the Dirichlet distribution to provide the prior probabilities for the terms. With topic A, we’ll make the first ~40% of terms have a higher probability of occurring, the last ~40% go with topic C, and the middle more associated with topic B. To give a sense of the alpha settings, alpha=c(8,1,1) would result in topic probabilities of .8, .1, .1, as would alpha=c(80,10,10), though the latter would serve as a much stronger prior. We’ll use the gtools package for the rdirichlet function. I also provide a visualization, where the dark represents terms that are notably less likely to be associated with a particular topic.
library(gtools)
Nterms = max(WordsPerDoc)
breaks = quantile(1:Nterms, c(.4,.6,1)) %>% round()
cuts = list(1:breaks[1], (breaks[1]+1):breaks[2], (breaks[2]+1):Nterms)
phi_k = matrix(0, ncol=3, nrow=Nterms)
phi_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])), # topics for 1st 40% of terms
rep(1, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
phi_k[,2] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for middle 20%
rep(10, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
phi_k[,3] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for last 40%
rep(1, length(cuts[[2]])),
rep(10, length(cuts[[3]]))))Now, given the topic assignment, we draw words for each document according to its size via a multinomial draw.
wordlist_1 = sapply(1:Ndocs, function(i) t(rmultinom(1, WordsPerDoc[i], phi_k[,z[i]]))
, simplify = F)
# smash to doc-term matrix
ldadat_1 = do.call(rbind, wordlist_1)
colnames(ldadat_1) = paste0('word', 1:Nterms)
# bag of words representation
wordlist_1 = lapply(wordlist_1, function(wds) rep(paste0('word', 1:length(wds)), wds))
table(wordlist_1[[1]]) %>% as_data_frame() %>% arrange(desc(n)) # example document 1# A tibble: 58 × 2
Var1 n
<chr> <int>
1 word93 7
2 word82 6
3 word106 5
4 word117 5
5 word84 5
6 word99 5
7 word122 4
8 word125 4
9 word101 3
10 word102 3
# ... with 48 more rows
With a matrix approach, we don’t need an explicit topic label, just the topic indicator matrix and topic probabilities. I will let you see this for yourself that the results are similar, but we’ll just stick to the previous results to keep things clear.
ZB = tcrossprod(Z, phi_k)
ldadat_2 = t(sapply(1:Ndocs, function(i) rmultinom(1, WordsPerDoc[i], ZB[i,])))
wordlist_2 = apply(ldadat_2, 1, function(row) rep(paste0('word', which(row!=0)), row[row!=0]) )
colnames(ldadat_2) = paste0('word', 1:Nterms)We’re now ready to run the models. Depending on the number of documents, terms, and other settings, it is possible to get an unsatisfactory/random result. Usually a redo is enough to recover topic probabilities, but the commented settings are an attempt to get better result from the outset. However, they will notably slow things down.
library(topicmodels)
controlSettings = list(nstart=10, verbose=100)
# others var=list(iter.max=5000, tol=10e-8), em=list(iter.max=5000, tol=10e-8)
LDA_1 = LDA(ldadat_1, k=3, control=controlSettings)Let’s look at the results. To be clear, the topic labels are completely arbitrary, so what is topic “1” for one analysis might be topic “3” for another. In this case, they align such that A-1, B-2, and C-3. The main thing for us is the recovery of the distribution of the topics, which will be roughly 0.6, 0.25, 0.15 for the first half (arbitrary order), and 0.1, 0.1, 0.8 for the other. Visually we can see the expected result, the first half of the documents are mostly associated with the first topic, and a little more so for the second. The last half are mostly associated with the third topic and little else.
LDA_1Post = posterior(LDA_1)We can compare our results with the true topic probabilities.
| A | B | C | |
|---|---|---|---|
| true_topic_probs_first_half | 0.60 | 0.25 | 0.15 |
| true_topic_probs_second_half | 0.10 | 0.10 | 0.80 |
| LDA_1_first_half | 0.61 | 0.26 | 0.13 |
| LDA_1_second_half | 0.10 | 0.09 | 0.81 |
The tidytext package can facilitate working with LDA results, as well as doing all the text pre-processing beforehand. As it is relatively new to the scene, and I thought it might be useful to demonstrate some of how it works. The first step would assume your initial data is in ‘tidy’ format (or what is more typically called ‘long’ form). In this case, each row represents words and we have a column for how many times the word occurs, and the document id. Note that I drop 0 counts. This is because tidytext will not create at sparse dtm object otherwise.
library(tidytext)
lda_dat_df = ldadat_1 %>%
as_data_frame() %>%
mutate(doc=1:n()) %>%
gather(key=term, value=count, -doc) %>%
filter(count>0) %>%
arrange(doc)
lda_dat_df# A tibble: 26,086 × 3
doc term count
<int> <chr> <int>
1 1 word5 1
2 1 word7 2
3 1 word12 2
4 1 word28 1
5 1 word32 1
6 1 word33 2
7 1 word46 1
8 1 word53 1
9 1 word56 1
10 1 word58 1
# ... with 26,076 more rows
Again, tidytext will help you get to that state via the initial processing. Once there, you can then create the document term matrix with cast_dtm.
dtm = lda_dat_df %>%
cast_dtm(doc, term, count)
dtm<<DocumentTermMatrix (documents: 500, terms: 135)>>
Non-/sparse entries: 26086/41414
Sparsity : 61%
Maximal term length: 7
Weighting : term frequency (tf)
Once you have the topic model results, you can then start working with term and topic probabilities. First we’ll get the top 10 terms and graphically display them. Recall that topic C-3 should be seen with later words, and A-1 with earlier, and this is born out fairly clearly. Topic B was our less frequent topic with a smaller vocabulary. See also the LDAvis for an interactive way to examine topic and term relationships.
terms(LDA_1, 10)| Topic 1 | Topic 2 | Topic 3 |
|---|---|---|
| word22 | word67 | word95 |
| word44 | word76 | word100 |
| word15 | word63 | word120 |
| word49 | word74 | word94 |
| word37 | word71 | word86 |
| word1 | word77 | word115 |
| word11 | word57 | word126 |
| word45 | word72 | word93 |
| word34 | word58 | word96 |
| word12 | word56 | word132 |
top_terms = tidy(LDA_1) %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)dt_probs <- tidy(LDA_1, matrix = "gamma")
dt_probs %>%
group_by(document, topic) %>%
summarise(meanProb=mean(gamma)) %>%
head %>%
pander()| document | topic | meanProb |
|---|---|---|
| 1 | 1 | 0.0003704573 |
| 1 | 2 | 0.9992590854 |
| 1 | 3 | 0.0003704573 |
| 2 | 1 | 0.9991931432 |
| 2 | 2 | 0.0004034284 |
| 2 | 3 | 0.0004034284 |
A rough breakdown of expected topic probabilities across all documents is \(.6*Ndocs/2 + .1*Ndocs/2 =\) 0.35 for topic A, \(.25*Ndocs/2 + .1*Ndocs/2 =\) 0.175 for B, and \(.8*Ndocs/2 + .15*Ndocs/2 =\) 0.475 for C. We can use the topics function to extract the most likely topic per document and get the frequency of occurrence. Note that we don’t have to say that each document is only associated with one topic in general.
prop.table(table(topics(LDA_1)))
1 2 3
0.352 0.178 0.470
# dplyr is notably more verbose here, but can be used also
dt_probs %>%
group_by(document) %>%
summarise (topic = which.max(gamma)) %>%
group_by(topic) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)) %>%
pander()| topic | n | freq |
|---|---|---|
| 1 | 176 | 0.352 |
| 2 | 89 | 0.178 |
| 3 | 235 | 0.470 |
Looks like we’re right where we’d expect.
This document conceptually demonstrates the mechanics of LDA’s data generating process. I suggest you go back and fiddle with the settings to see how things change. Note that just about everything you’ll come across with LDA will regard topic modeling, but one of its earliest uses was actually in genetics. Any time you have a matrix of counts like this, LDA is a potential candidate for analysis, and might be preferable to PCA.